Robotics 24
☆ SAM 2 in Robotic Surgery: An Empirical Evaluation for Robustness and Generalization in Surgical Video Segmentation
The recent Segment Anything Model (SAM) 2 has demonstrated remarkable
foundational competence in semantic segmentation, with its memory mechanism and
mask decoder further addressing challenges in video tracking and object
occlusion, thereby achieving superior results in interactive segmentation for
both images and videos. Building upon our previous empirical studies, we
further explore the zero-shot segmentation performance of SAM 2 in
robot-assisted surgery based on prompts, alongside its robustness against
real-world corruption. For static images, we employ two forms of prompts:
1-point and bounding box, while for video sequences, the 1-point prompt is
applied to the initial frame. Through extensive experimentation on the MICCAI
EndoVis 2017 and EndoVis 2018 benchmarks, SAM 2, when utilizing bounding box
prompts, outperforms state-of-the-art (SOTA) methods in comparative
evaluations. The results with point prompts also exhibit a substantial
enhancement over SAM's capabilities, nearing or even surpassing existing
unprompted SOTA methodologies. Besides, SAM 2 demonstrates improved inference
speed and less performance degradation against various image corruption.
Although slightly unsatisfactory results remain in specific edges or regions,
SAM 2's robust adaptability to 1-point prompts underscores its potential for
downstream surgical tasks with limited prompt requirements.
comment: Empirical study. Previous work "SAM Meets Robotic Surgery" is
accessible at: arXiv:2308.07156
☆ FORGE: Force-Guided Exploration for Robust Contact-Rich Manipulation under Uncertainty
Michael Noseworthy, Bingjie Tang, Bowen Wen, Ankur Handa, Nicholas Roy, Dieter Fox, Fabio Ramos, Yashraj Narang, Iretiayo Akinola
We present FORGE, a method that enables sim-to-real transfer of contact-rich
manipulation policies in the presence of significant pose uncertainty. FORGE
combines a force threshold mechanism with a dynamics randomization scheme
during policy learning in simulation, to enable the robust transfer of the
learned policies to the real robot. At deployment, FORGE policies, conditioned
on a maximum allowable force, adaptively perform contact-rich tasks while
respecting the specified force threshold, regardless of the controller gains.
Additionally, FORGE autonomously predicts a termination action once the task
has succeeded. We demonstrate that FORGE can be used to learn a variety of
robust contact-rich policies, enabling multi-stage assembly of a planetary gear
system, which requires success across three assembly tasks: nut-threading,
insertion, and gear meshing. Project website can be accessed at
https://noseworm.github.io/forge/.
☆ A Learning-Based Model Predictive Contouring Control for Vehicle Evasive Manoeuvres
This paper presents a novel Learning-based Model Predictive Contouring
Control (L-MPCC) algorithm for evasive manoeuvres at the limit of handling. The
algorithm uses the Student-t Process (STP) to minimise model mismatches and
uncertainties online. The proposed STP captures the mismatches between the
prediction model and the measured lateral tyre forces and yaw rate. The
mismatches correspond to the posterior means provided to the prediction model
to improve its accuracy. Simultaneously, the posterior covariances are
propagated to the vehicle lateral velocity and yaw rate along the prediction
horizon. The STP posterior covariance directly depends on the variance of
observed data, so its variance is more significant when the online measurements
differ from the recorded ones in the training set and smaller in the opposite
case. Thus, these covariances can be utilised in the L-MPCC's cost function to
minimise the vehicle state uncertainties. In a high-fidelity simulation
environment, we demonstrate that the proposed L-MPCC can successfully avoid
obstacles, keeping the vehicle stable while driving a double lane change
manoeuvre at a higher velocity than an MPCC without STP. Furthermore, the
proposed controller yields a significantly lower peak sideslip angle, improving
the vehicle's manoeuvrability compared to an L-MPCC with a Gaussian Process.
comment: The work will be presented at AVEC'24 in Milan
☆ SegXAL: Explainable Active Learning for Semantic Segmentation in Driving Scene Scenarios ICPR
Most of the sophisticated AI models utilize huge amounts of annotated data
and heavy training to achieve high-end performance. However, there are certain
challenges that hinder the deployment of AI models "in-the-wild" scenarios,
i.e., inefficient use of unlabeled data, lack of incorporation of human
expertise, and lack of interpretation of the results. To mitigate these
challenges, we propose a novel Explainable Active Learning (XAL) model,
XAL-based semantic segmentation model "SegXAL", that can (i) effectively
utilize the unlabeled data, (ii) facilitate the "Human-in-the-loop" paradigm,
and (iii) augment the model decisions in an interpretable way. In particular,
we investigate the application of the SegXAL model for semantic segmentation in
driving scene scenarios. The SegXAL model proposes the image regions that
require labeling assistance from Oracle by dint of explainable AI (XAI) and
uncertainty measures in a weakly-supervised manner. Specifically, we propose a
novel Proximity-aware Explainable-AI (PAE) module and Entropy-based Uncertainty
(EBU) module to get an Explainable Error Mask, which enables the machine
teachers/human experts to provide intuitive reasoning behind the results and to
solicit feedback to the AI system via an active learning strategy. Such a
mechanism bridges the semantic gap between man and machine through
collaborative intelligence, where humans and AI actively enhance each other's
complementary strengths. A novel high-confidence sample selection technique
based on the DICE similarity coefficient is also presented within the SegXAL
framework. Extensive quantitative and qualitative analyses are carried out in
the benchmarking Cityscape dataset. Results show the outperformance of our
proposed SegXAL against other state-of-the-art models.
comment: 17 pages, 7 figures. To appear in the proceedings of the 27th
International Conference on Pattern Recognition (ICPR), 01-05 December, 2024,
Kolkata, India
☆ A Review of 3D Reconstruction Techniques for Deformable Tissues in Robotic Surgery MICCAI 2024
As a crucial and intricate task in robotic minimally invasive surgery,
reconstructing surgical scenes using stereo or monocular endoscopic video holds
immense potential for clinical applications. NeRF-based techniques have
recently garnered attention for the ability to reconstruct scenes implicitly.
On the other hand, Gaussian splatting-based 3D-GS represents scenes explicitly
using 3D Gaussians and projects them onto a 2D plane as a replacement for the
complex volume rendering in NeRF. However, these methods face challenges
regarding surgical scene reconstruction, such as slow inference, dynamic
scenes, and surgical tool occlusion. This work explores and reviews
state-of-the-art (SOTA) approaches, discussing their innovations and
implementation principles. Furthermore, we replicate the models and conduct
testing and evaluation on two datasets. The test results demonstrate that with
advancements in these techniques, achieving real-time, high-quality
reconstructions becomes feasible.
comment: To appear in MICCAI 2024 EARTH Workshop. Code availability:
https://github.com/Epsilon404/surgicalnerf
☆ UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation
Smart autonomous agents are becoming increasingly important in various
real-life applications, including robotics and autonomous vehicles. One crucial
skill that these agents must possess is the ability to interact with their
surrounding entities, such as other agents or humans. In this work, we aim at
building an intelligent agent that can efficiently navigate in an environment
while being able to interact with an oracle (or human) in natural language and
ask for directions when it is unsure about its navigation performance. The
interaction is started by the agent that produces a question, which is then
answered by the oracle on the basis of the shortest trajectory to the goal. The
process can be performed multiple times during navigation, thus enabling the
agent to hold a dialogue with the oracle. To this end, we propose a novel
computational model, named UNMuTe, that consists of two main components: a
dialogue model and a navigator. Specifically, the dialogue model is based on a
GPT-2 decoder that handles multimodal data consisting of both text and images.
First, the dialogue model is trained to generate question-answer pairs: the
question is generated using the current image, while the answer is produced
leveraging future images on the path toward the goal. Subsequently, a VLN model
is trained to follow the dialogue predicting navigation actions or triggering
the dialogue model if it needs help. In our experimental analysis, we show that
UNMuTe achieves state-of-the-art performance on the main navigation tasks
implying dialogue, i.e. Cooperative Vision and Dialogue Navigation (CVDN) and
Navigation from Dialogue History (NDH), proving that our approach is effective
in generating useful questions and answers to guide navigation.
☆ Deep Generative Models in Robotics: A Survey on Learning from Multimodal Demonstrations
Julen Urain, Ajay Mandlekar, Yilun Du, Mahi Shafiullah, Danfei Xu, Katerina Fragkiadaki, Georgia Chalvatzaki, Jan Peters
Learning from Demonstrations, the field that proposes to learn robot behavior
models from data, is gaining popularity with the emergence of deep generative
models. Although the problem has been studied for years under names such as
Imitation Learning, Behavioral Cloning, or Inverse Reinforcement Learning,
classical methods have relied on models that don't capture complex data
distributions well or don't scale well to large numbers of demonstrations. In
recent years, the robot learning community has shown increasing interest in
using deep generative models to capture the complexity of large datasets. In
this survey, we aim to provide a unified and comprehensive review of the last
year's progress in the use of deep generative models in robotics. We present
the different types of models that the community has explored, such as
energy-based models, diffusion models, action value maps, or generative
adversarial networks. We also present the different types of applications in
which deep generative models have been used, from grasp generation to
trajectory generation or cost learning. One of the most important elements of
generative models is the generalization out of distributions. In our survey, we
review the different decisions the community has made to improve the
generalization of the learned models. Finally, we highlight the research
challenges and propose a number of future directions for learning deep
generative models in robotics.
comment: 20 pages, 11 figures, submitted to TRO
☆ Assigning Credit with Partial Reward Decoupling in Multi-Agent Proximal Policy Optimization
Multi-agent proximal policy optimization (MAPPO) has recently demonstrated
state-of-the-art performance on challenging multi-agent reinforcement learning
tasks. However, MAPPO still struggles with the credit assignment problem,
wherein the sheer difficulty in ascribing credit to individual agents' actions
scales poorly with team size. In this paper, we propose a multi-agent
reinforcement learning algorithm that adapts recent developments in credit
assignment to improve upon MAPPO. Our approach leverages partial reward
decoupling (PRD), which uses a learned attention mechanism to estimate which of
a particular agent's teammates are relevant to its learning updates. We use
this estimate to dynamically decompose large groups of agents into smaller,
more manageable subgroups. We empirically demonstrate that our approach,
PRD-MAPPO, decouples agents from teammates that do not influence their expected
future reward, thereby streamlining credit assignment. We additionally show
that PRD-MAPPO yields significantly higher data efficiency and asymptotic
performance compared to both MAPPO and other state-of-the-art methods across
several multi-agent tasks, including StarCraft II. Finally, we propose a
version of PRD-MAPPO that is applicable to \textit{shared} reward settings,
where PRD was previously not applicable, and empirically show that this also
leads to performance improvements over MAPPO.
comment: 20 pages, 5 figures, 12 tables, Reinforcement Learning Journal and
Reinforcement Learning Conference 2024
☆ BPMP-Tracker: A Versatile Aerial Target Tracker Using Bernstein Polynomial Motion Primitives
This letter presents a versatile trajectory planning pipeline for aerial
tracking. The proposed tracker is capable of handling various chasing settings
such as complex unstructured environments, crowded dynamic obstacles and
multiple-target following. Among the entire pipeline, we focus on developing a
predictor for future target motion and a chasing trajectory planner. For rapid
computation, we employ the sample-check-select strategy: modules sample a set
of candidate movements, check multiple constraints, and then select the best
trajectory. Also, we leverage the properties of Bernstein polynomials for quick
calculations. The prediction module predicts the trajectories of the targets,
which do not overlap with static and dynamic obstacles. Then the trajectory
planner outputs a trajectory, ensuring various conditions such as occlusion and
collision avoidance, the visibility of all targets within a camera image and
dynamical limits. We fully test the proposed tracker in simulations and
hardware experiments under challenging scenarios, including dual-target
following, environments with dozens of dynamic obstacles and complex indoor and
outdoor spaces.
comment: 8 pages, 9 figures
☆ Temporal Logic Planning via Zero-Shot Policy Composition
This work develops a zero-shot mechanism for an agent to satisfy a Linear
Temporal Logic (LTL) specification given existing task primitives. Oftentimes,
autonomous robots need to satisfy spatial and temporal goals that are unknown
until run time. Prior research addresses the problem by learning policies that
are capable of executing a high-level task specified using LTL, but they
incorporate the specification into the learning process; therefore, any change
to the specification requires retraining the policy. Other related research
addresses the problem by creating skill-machines which, given a specification
change, do not require full policy retraining but require fine-tuning on the
skill-machine to guarantee satisfaction. We present a more a flexible approach
-- to learn a set of minimum-violation (MV) task primitive policies that can be
used to satisfy arbitrary LTL specifications without retraining or fine-tuning.
Task primitives can be learned offline using reinforcement learning (RL)
methods and combined using Boolean composition at deployment. This work focuses
on creating and pruning a transition system (TS) representation of the
environment in order to solve for deterministic, non-ambiguous, and feasible
solutions to LTL specifications given an environment and a set of MV task
primitive policies. We show that our pruned TS is deterministic, contains no
unrealizable transitions, and is sound. Through simulation, we show that our
approach is executable and we verify our MV policies produce the expected
symbols.
comment: 16 pages, 11 figures
☆ Koopman Operators in Robot Learning
Lu Shi, Masih Haseli, Giorgos Mamakoukas, Daniel Bruder, Ian Abraham, Todd Murphey, Jorge Cortes, Konstantinos Karydis
Koopman operator theory offers a rigorous treatment of dynamics and has been
emerging as a powerful modeling and learning-based control method enabling
significant advancements across various domains of robotics. Due to its ability
to represent nonlinear dynamics as a linear operator, Koopman theory offers a
fresh lens through which to understand and tackle the modeling and control of
complex robotic systems. Moreover, it enables incremental updates and is
computationally inexpensive making it particularly appealing for real-time
applications and online active learning. This review comprehensively presents
recent research results on advancing Koopman operator theory across diverse
domains of robotics, encompassing aerial, legged, wheeled, underwater, soft,
and manipulator robotics. Furthermore, it offers practical tutorials to help
new users get started as well as a treatise of more advanced topics leading to
an outlook on future directions and open research questions. Taken together,
these provide insights into the potential evolution of Koopman theory as
applied to the field of robotics.
☆ F1tenth Autonomous Racing With Offline Reinforcement Learning Methods
Autonomous racing serves as a critical platform for evaluating automated
driving systems and enhancing vehicle mobility intelligence. This work
investigates offline reinforcement learning methods to train agents within the
dynamic F1tenth racing environment. The study begins by exploring the
challenges of online training in the Austria race track environment, where
agents consistently fail to complete the laps. Consequently, this research
pivots towards an offline strategy, leveraging `expert' demonstration dataset
to facilitate agent training. A waypoint-based suboptimal controller is
developed to gather data with successful lap episodes. This data is then
employed to train offline learning-based algorithms, with a subsequent analysis
of the agents' cross-track performance, evaluating their zero-shot
transferability from seen to unseen scenarios and their capacity to adapt to
changes in environment dynamics. Beyond mere algorithm benchmarking in
autonomous racing scenarios, this study also introduces and describes the
machinery of our return-conditioned decision tree-based policy, comparing its
performance with methods that employ fully connected neural networks,
Transformers, and Diffusion Policies and highlighting some insights into method
selection for training autonomous agents in driving interactions.
☆ Design and Implementation of Smart Infrastructures and Connected Vehicles in A Mini-city Platform SC
This paper presents a 1/10th scale mini-city platform used as a testing bed
for evaluating autonomous and connected vehicles. Using the mini-city platform,
we can evaluate different driving scenarios including human-driven and
autonomous driving. We provide a unique, visual feature-rich environment for
evaluating computer vision methods. The conducted experiments utilize onboard
sensors mounted on a robotic platform we built, allowing them to navigate in a
controlled real-world urban environment. The designed city is occupied by cars,
stop signs, a variety of residential and business buildings, and complex
intersections mimicking an urban area. Furthermore, We have designed an
intelligent infrastructure at one of the intersections in the city which helps
safer and more efficient navigation in the presence of multiple cars and
pedestrians. We have used the mini-city platform for the analysis of three
different applications: city mapping, depth estimation in challenging occluded
environments, and smart infrastructure for connected vehicles. Our smart
infrastructure is among the first to develop and evaluate
Vehicle-to-Infrastructure (V2I) communication at intersections. The
intersection-related result shows how inaccuracy in perception, including
mapping and localization, can affect safety. The proposed mini-city platform
can be considered as a baseline environment for developing research and
education in intelligent transportation systems.
comment: 8 pages, 9 figures, Presented at 2024 IEEE ITSC Conference, 23
Citations
☆ Everyday Finger: A Robotic Finger that Meets the Needs of Everyday Interactive Manipulation ICRA 2024
We provide the mechanical and dynamical requirements for a robotic finger
capable of performing thirty diverse everyday tasks. To match these
requirements, we present a finger design based on series-elastic actuation that
we call the everyday finger. Our focus is to make the fingers as compact as
possible while achieving the desired performance. We evaluated everyday fingers
by constructing a two-finger robotic hand that was tested on various
performance parameters and tasks like picking and placing dishes in a rack,
picking thin and flat objects like paper and delicate objects such as
strawberries. Videos are available at the project website:
https://sites.google.com/view/everydayfinger.
comment: 9.5 pages + references, 14 figures, extended/updated version of
article to appear in IEEE ICRA 2024 proceedings
♻ ☆ State Representations as Incentives for Reinforcement Learning Agents: A Sim2Real Analysis on Robotic Grasping
Choosing an appropriate representation of the environment for the underlying
decision-making process of the reinforcement learning agent is not always
straightforward. The state representation should be inclusive enough to allow
the agent to informatively decide on its actions and disentangled enough to
simplify policy training and the corresponding sim2real transfer. Given this
outlook, this work examines the effect of various representations in
incentivizing the agent to solve a specific robotic task: antipodal and planar
object grasping. A continuum of state representations is defined, starting from
hand-crafted numerical states to encoded image-based representations, with
decreasing levels of induced task-specific knowledge. The effects of each
representation on the ability of the agent to solve the task in simulation and
the transferability of the learned policy to the real robot are examined and
compared against a model-based approach with complete system knowledge. The
results show that reinforcement learning agents using numerical states can
perform on par with non-learning baselines. Furthermore, we find that agents
using image-based representations from pre-trained environment embedding
vectors perform better than end-to-end trained agents, and hypothesize that
separation of representation learning from reinforcement learning can benefit
sim2real transfer. Finally, we conclude that incentivizing the state
representation with task-specific knowledge facilitates faster convergence for
agent training and increases success rates in sim2real robot control.
comment: Accepted to IEEE International Conference on Systems, Man, and
Cybernetics (SMC) 2024
♻ ☆ Conformal Temporal Logic Planning using Large Language Models
This paper addresses planning problems for mobile robots. We consider
missions that require accomplishing multiple high-level sub-tasks, expressed in
natural language (NL), in a temporal and logical order. To formally define the
mission, we treat these sub-tasks as atomic predicates in a Linear Temporal
Logic (LTL) formula. We refer to this task specification framework as LTL-NL.
Our goal is to design plans, defined as sequences of robot actions,
accomplishing LTL-NL tasks. This action planning problem cannot be solved
directly by existing LTL planners because of the NL nature of atomic
predicates. To address it, we propose HERACLEs, a hierarchical neuro-symbolic
planner that relies on a novel integration of (i) existing symbolic planners
generating high-level task plans determining the order at which the NL
sub-tasks should be accomplished; (ii) pre-trained Large Language Models (LLMs)
to design sequences of robot actions based on these task plans; and (iii)
conformal prediction acting as a formal interface between (i) and (ii) and
managing uncertainties due to LLM imperfections. We show, both theoretically
and empirically, that HERACLEs can achieve user-defined mission success rates.
Finally, we provide comparative experiments demonstrating that HERACLEs
outperforms LLM-based planners that require the mission to be defined solely
using NL. Additionally, we present examples demonstrating that our approach
enhances user-friendliness compared to conventional symbolic approaches.
♻ ☆ Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning
In offline reinforcement learning (RL), an RL agent learns to solve a task
using only a fixed dataset of previously collected data. While offline RL has
been successful in learning real-world robot control policies, it typically
requires large amounts of expert-quality data to learn effective policies that
generalize to out-of-distribution states. Unfortunately, such data is often
difficult and expensive to acquire in real-world tasks. Several recent works
have leveraged data augmentation (DA) to inexpensively generate additional
data, but most DA works apply augmentations in a random fashion and ultimately
produce highly suboptimal augmented experience. In this work, we propose Guided
Data Augmentation (GuDA), a human-guided DA framework that generates
expert-quality augmented data. The key insight behind GuDA is that while it may
be difficult to demonstrate the sequence of actions required to produce expert
data, a user can often easily characterize when an augmented trajectory segment
represents progress toward task completion. Thus, a user can restrict the space
of possible augmentations to automatically reject suboptimal augmented data. To
extract a policy from GuDA, we use off-the-shelf offline reinforcement learning
and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer
task as well as simulated D4RL navigation tasks, a simulated autonomous driving
task, and a simulated soccer task. Empirically, GuDA enables learning given a
small initial dataset of potentially suboptimal experience and outperforms a
random DA strategy as well as a model-based DA strategy.
comment: RLC 2024
♻ ☆ Safety-Aware Human-Lead Vehicle Platooning by Proactively Reacting to Uncertain Human Behaving
Human-Lead Cooperative Adaptive Cruise Control (HL-CACC) is regarded as a
promising vehicle platooning technology in real-world implementation. By
utilizing a Human-driven Vehicle (HV) as the platoon leader, HL-CACC reduces
the cost and enhances the reliability of perception and decision-making.
However, state-of-the-art HL-CACC technology still has a great limitation on
driving safety for the lack of considering the leading human driver's uncertain
behaving. In this study, a HL-CACC controller is designed based on Stochastic
Model Predictive Control (SMPC). It is enabled to predict the driving intention
of the leading Connected Human-Driven Vehicle (CHV). The proposed controller
has the following features: i) enhanced perceived safety in oscillating
traffic; ii) guaranteed safety against hard brakes; iii) computational
efficient for real-time implementation. The proposed controller is evaluated on
a PreScan&Simulink simulation platform. Real vehicle trajectory data is
collected for the calibration of simulation. Results reveal that the proposed
controller: i) improves perceived safety by 19.17% in oscillating traffic; ii)
enhances actual safety by 7.76% against hard brake; iii) is confirmed with
string stability. The computation time is approximately 3 milliseconds when
running on a laptop equipped with an Intel i5-13500H CPU. This indicates the
proposed controller is ready for real-time implementation.
♻ ☆ Automatic Target-Less Camera-LiDAR Calibration From Motion and Deep Point Correspondences
Sensor setups of robotic platforms commonly include both camera and LiDAR as
they provide complementary information. However, fusing these two modalities
typically requires a highly accurate calibration between them. In this paper,
we propose MDPCalib which is a novel method for camera-LiDAR calibration that
requires neither human supervision nor any specific target objects. Instead, we
utilize sensor motion estimates from visual and LiDAR odometry as well as deep
learning-based 2D-pixel-to-3D-point correspondences that are obtained without
in-domain retraining. We represent camera-LiDAR calibration as an optimization
problem and minimize the costs induced by constraints from sensor motion and
point correspondences. In extensive experiments, we demonstrate that our
approach yields highly accurate extrinsic calibration parameters and is robust
to random initialization. Additionally, our approach generalizes to a wide
range of sensor setups, which we demonstrate by employing it on various robotic
platforms including a self-driving perception car, a quadruped robot, and a
UAV. To make our calibration method publicly accessible, we release the code on
our project website at http://calibration.cs.uni-freiburg.de.
♻ ☆ Smooth Model Predictive Path Integral Control without Smoothing IROS 2022
We present a sampling-based control approach that can generate smooth actions
for general nonlinear systems without external smoothing algorithms. Model
Predictive Path Integral (MPPI) control has been utilized in numerous robotic
applications due to its appealing characteristics to solve non-convex
optimization problems. However, the stochastic nature of sampling-based methods
can cause significant chattering in the resulting commands. Chattering becomes
more prominent in cases where the environment changes rapidly, possibly even
causing the MPPI to diverge. To address this issue, we propose a method that
seamlessly combines MPPI with an input-lifting strategy. In addition, we
introduce a new action cost to smooth control sequence during trajectory
rollouts while preserving the information theoretic interpretation of MPPI,
which was derived from non-affine dynamics. We validate our method in two
nonlinear control tasks with neural network dynamics: a pendulum swing-up task
and a challenging autonomous driving task. The experimental results demonstrate
that our method outperforms the MPPI baselines with additionally applied
smoothing algorithms.
comment: Accepted to IEEE Robotics and Automation Letters (and IROS 2022). Our
video can be found at https://youtu.be/fyngK8PCoyM
♻ ☆ KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance
Online Imitation Learning methods struggle with the gap between extensive
online exploration space and limited expert trajectories, which hinder
efficient exploration due to inaccurate task-aware reward estimation. Inspired
by the findings from cognitive neuroscience that task decomposition could
facilitate cognitive processing for efficient learning, we hypothesize that an
agent could estimate precise task-aware imitation rewards for efficient online
exploration by decomposing the target task into the objectives of "what to do"
and the mechanisms of "how to do". In this work, we introduce the hybrid
Key-state guided Online Imitation (KOI) learning approach, which leverages the
integration of semantic and motion key states as guidance for task-aware reward
estimation. Initially, we utilize the visual-language models to segment the
expert trajectory into semantic key states, indicating the objectives of "what
to do". Within the intervals between semantic key states, optical flow is
employed to capture motion key states to understand the process of "how to do".
By integrating a thorough grasp of both semantic and motion key states, we
refine the trajectory-matching reward computation, encouraging task-aware
exploration for efficient online imitation learning. Our experiment results
prove that our method is more sample efficient in the Meta-World and LIBERO
environments. We also conduct real-world robotic manipulation experiments to
validate the efficacy of our method, demonstrating the practical applicability
of our KOI method.
♻ ☆ MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models
Peng Ding, Jiading Fang, Peng Li, Kangrui Wang, Xiaochen Zhou, Mo Yu, Jing Li, Matthew R. Walter, Hongyuan Mei
Large language models such as ChatGPT and GPT-4 have recently achieved
astonishing performance on a variety of natural language processing tasks. In
this paper, we propose MANGO, a benchmark to evaluate their capabilities to
perform text-based mapping and navigation. Our benchmark includes 53 mazes
taken from a suite of textgames: each maze is paired with a walkthrough that
visits every location but does not cover all possible paths. The task is
question-answering: for each maze, a large language model reads the walkthrough
and answers hundreds of mapping and navigation questions such as "How should
you go to Attic from West of House?" and "Where are we if we go north and east
from Cellar?". Although these questions are easy to humans, it turns out that
even GPT-4, the best-to-date language model, performs poorly at answering them.
Further, our experiments suggest that a strong mapping and navigation ability
would benefit large language models in performing relevant downstream tasks,
such as playing textgames. Our MANGO benchmark will facilitate future research
on methods that improve the mapping and navigation capabilities of language
models. We host our leaderboard, data, code, and evaluation program at
https://mango.ttic.edu and https://github.com/oaklight/mango/.
comment: COLM 2024 camera-ready
♻ ☆ Collision Avoidance using Iterative Dynamic and Nonlinear Programming with Adaptive Grid Refinements
Nonlinear optimal control problems for trajectory planning with obstacle
avoidance present several challenges. While general-purpose optimizers and
dynamic programming methods struggle when adopted separately, their combination
enabled by a penalty approach is capable of handling highly nonlinear systems
while overcoming the curse of dimensionality. Nevertheless, using dynamic
programming with a fixed state space discretization limits the set of reachable
solutions, hindering convergence or requiring enormous memory resources for
uniformly spaced grids. In this work we solve this issue by incorporating an
adaptive refinement of the state space grid, splitting cells where needed to
better capture the problem structure while requiring less discretization points
overall. Numerical results on a space manipulator demonstrate the improved
robustness and efficiency of the combined method with respect to the single
components.
♻ ☆ Reflectance Estimation for Proximity Sensing by Vision-Language Models: Utilizing Distributional Semantics for Low-Level Cognition in Robotics
Large language models (LLMs) and vision-language models (VLMs) have been
increasingly used in robotics for high-level cognition, but their use for
low-level cognition, such as interpreting sensor information, remains
underexplored. In robotic grasping, estimating the reflectance of objects is
crucial for successful grasping, as it significantly impacts the distance
measured by proximity sensors. We investigate whether LLMs can estimate
reflectance from object names alone, leveraging the embedded human knowledge in
distributional semantics, and if the latent structure of language in VLMs
positively affects image-based reflectance estimation. In this paper, we verify
that 1) LLMs such as GPT-3.5 and GPT-4 can estimate an object's reflectance
using only text as input; and 2) VLMs such as CLIP can increase their
generalization capabilities in reflectance estimation from images. Our
experiments show that GPT-4 can estimate an object's reflectance using only
text input with a mean error of 14.7%, lower than the image-only ResNet.
Moreover, CLIP achieved the lowest mean error of 11.8%, while GPT-3.5 obtained
a competitive 19.9% compared to ResNet's 17.8%. These results suggest that the
distributional semantics in LLMs and VLMs increases their generalization
capabilities, and the knowledge acquired by VLMs benefits from the latent
structure of language.
comment: 24 pages, 13 figures, submitted to Advanced Robotics Special Issue on
Real-World Robot Applications of the Foundation Models